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Abstract 

The family name distribution in Korea is investigated in comparison with previous 
studies in other countries. In Korea, both the family name and its birthplace, where 
the ancestor of the family originated, are commonly used to distinguish one family 
name from the others. The family name distributions with and without the infor- 
mation of the regional origins are analyzed by using different data sets of various 
sizes, and compared with previous studies performed in other countries. The growth 
rate of the family is empirically obtained. Contrary to commonly used assumptions, 
the growth rate is found to be higher for the smaller family. 
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1 Introduction 

Different from other countries, Korea is known to have relatively small number 
of family names. For example, the nationwide survey in 2000 has shown that 
the population of the family name Kim is about 10 millions, which means 
that about one among four Koreans is Kim. In general, the family names have 
been devised to distinguish one family from others. Accordingly, if the number 
of family names is too small, or family sizes are too big, the distinction with 
other families becomes very inefficient. From this reasoning, it is natural that 
Koreans have developed additional way to give a kind of sub-system by using 
the information of the birthplace of the family name, i.e., where the ancestor 
of the family came from. For example, one of the authors of the present paper 
has the family name Kim from Gimhae, a region in southern Korea. The family 
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Table 1 

Date sets: KoreaOO and Korea85 are for the total populations in Korea in the 
years 2000 and 1985 [1]. Seongnam98, Osan04, and Hwaseong04 are extracted from 
telephone books, and AjouOS is obtained from the list of registered students in Ajou 
University. A^,. and Nf are numbers of family names with and without the regional 
origins, and N is the population. 



Data set 


N 


Nf 


Nr 


KoreaOO 


45,985,289 


288 


4,188 


Korea85 


40,419,652 


277 


3,359 


Seongnam98 


248,460 


161 




Osan04 


19,632 


114 




Ajou03 


9,802 


109 




Hwaseong04 


3,952 


87 





name Kim in 2000 survey has been found to have 348 different regional origins. 
Almost all Koreans know the regional origins of their family names from their 
births and the state keeps this information in the population registration. 

In Korean culture, having an unfamiliar family name is not common at all, 

and most people hesitate to invent new names. For 15 years from 1985 to 
2000 (Table 1), the number Nj of family names in Korea increased only by 11 
(11/277 ~ 4% increase). In contrast, the number Nj. of the family names with 
regional origins (e.g., Kim from Gimhae is taken as different than Kim from 
Gyeongju), is increased in the same period by 829 (829/3359 = 25% increase) 
(see Table 1). Consequently, this shows that most Koreans consider inventing 
new family name as a taboo, while branching out by using new regional origins 
(but with the same family name) is totally acceptable in Korean culture. 

In this work, we study the distributions of the Korean family names with and 
without regional origins. From the above observations, one expects that the 
distributions look very different with and without the regional origins, and 
that the latter distribution is similar to other countries, where having new 
family name is not a taboo as in Korea. 

The data sets analyzed in this work are as follows (Table 1): Korea85 and 
KoreaOO are for the total population of Korea in the years 1985 and 2000, which 
were downloaded from Korean National Statistical Office [1]. The set Ajou03 
is from the list of registered students in Ajou University in 2003, and the 
other sets Seongnam98, Osan04, and Hwaseong04 are extracted from telephone 
books published in the corresponding cities at years 1998, 2004, and 2004, 
respectively. We have information of the regional origins only in Korea85 and 
KoreaOO. Throughout the paper, N is the size of populations, Nr and Nf are 
the numbers of family names with and without regional origins, respectively. 
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Fig. 1. The number Nf of family names without regional origins versus the popu- 
lation N. As the population is increased, Nf is shown to increase logarithmically. 
Note that only the horizontal axis is in log scale. Data are from Table 1 and the 
line is only guide to eyes. 



For the kth family name (when the family name is arranged from the biggest 
family to smallest one in descending order), f{k) is the number of people 
who have that name, leading to Efc=i f{k) = N. We also use the integrated 
probability distribution function Pint(^), which measures the proportion of the 
families with the size greater than n. 



The paper is organized as follows: In Sec. 2, which contains the main results 
of the present work, we study various aspects of family name distributions 
without regional origins. The relationships among the three basic quantities, 
Nf{N), f{k), and Pmt{n) are analytically found and compared with empiri- 
cal results. The distributions for various sizes of populations in Table 1 are 
compared with previous results in other countries. The empirically obtained 
growth rate as a function of the family size is also discussed. In Sec. 3, we per- 
form analysis on the distributions of family names with regional origin (e.g., 
Kim's from different regional origins are considered to be different names). 
Finally Sec. 4 is devoted to discussions and conclusions. 
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Fig. 2. (a)Integrated probability function Pintin) versus the family size n. Pinti^) 
measures the ratio of the number of families of the sizes not less than n to the 
total number of family names. In a broad range of n, Pint is described very well by 
Pintin) ~ — Inn. (b) The population f{k) as a function of the rank k of the family 
(Zipf's plot). The biggest family, Kim in this case, has the rank k = 1. In a broad 
range, f{k) shows exponential decay. Both (a) and (b) are obtained from the data 
set KoreaOO (see Table 1). 

2 Family Name Distribution 



2.1 Relations: Nf{N), f{k), and Pi„t(n) 



We first present in Fig. 1 the number Nf of family names versus the popu- 
lation. As the population is increased, the number of family names is shown 
to increase logarithmically. This observation is in a sharp contrast to Ref. [2], 
where Nf ~ A^^-^^ has been observed in Japan, instead of Nf ~ In N. Due to 
the lack of information, we are not able to study Nr versus A^. 

In Fig. 2(a), the integrated distribution function Pint{n), measuring the pro- 
portion of the family names with the populations greater than or equal to n is 
displayed. Since every family has at least one person as its member, one gets 
Pmt{n = 1) = 1. As n is increased, Pint{n,) is a monotonically decreasing func- 
tion of n until Pint = is reached for n > max^ f{k) — f{l). The logarithmic 
decay form in Fig. 2(a) is again very much different from the corresponding 
results for other countries, where the power-law form Pint(^^) ~ n^~'^ has been 
observed with 7 ^ 2 for U.S. A and Brazil [3], and 7 ^ 1.75 for Japan [2]. 
The probability distribution function P{n), which is simply the normalized 
histogram, is easily obtained from the derivative of Pint with respect to n from 
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^mt(") = P{'n')dn' , i.e., P{ri) = —dPyat/dn. In a practical point of view, 
-Pint(^) is niore convenient than P{n). since some practical issues like the bin 
size in P{n) are not needed to be taken care of. As will be clearly shown below 
the same logarithmic form in Figs. 1 and 2(a) is not accidental. 

Another convenient and frequently used way of showing the distribution is 
the so-called Zipf's plot [4], where the population f{k) of the family name is 
plotted in terms of the rank k. Figure 2(b) is the Zipf's plot of f{k) obtained 
from the data set KoreaOO (see Table 1). Clearly exhibited is that f{k) decays 
exponentially with fc, again in contrast to Ref. [2] where the power-law decay 
has been observed. 

The Zipf's plot of f{k) in Fig. 2(b) and Pint{n) are easily related since -Pint (^) 
is simply the number of family names of the size greater than n: Draw the 
horizontal line at the vertical position n in Fig. 2(b) and the value of k at 
the crossing point is simply Pint(^) multiplied by Nj. In other words, once the 
functional form f{k) is given, we obtain the relation 

PUn) = r\n)/Nf, (1) 

connecting the Zipf's plot and the integrated probability function ^ . The func- 
tion f{k) is always a monotonically decreasing function of k by definition, 
which confirms the existence of the inverse function f~^{ri). Via Eq. (1) the 
logarithmic form of Pintin) in Fig. 2(a) implies the exponential form of f{k) 
in Fig. 2(b) and vice versa. 

We next establish the simple mathematical relation between Pintin) obtained 
for the total population A^tot ^-nd Nf(N) obtained for various sizes A^ of pop- 
ulation. The only one assumption we make is that the set of A^ (= xAtot with 
< a; < 1) individuals is taken at random without any bias^ . In other words, 
the family at the rank k has f{k) members in total, and thus in the subset 
of the size xAtot, there are xf{k) members of the same name. Consequently, 
once f{k) for the total population is given, Nf for the subset of the size xAtot 
is simply the number of family names that has more than one individual, 
[xf{k) > 1], which leads to 

NfixN,,,)^Y.^{f{k)-l/x), (2) 

k 

where ©(y) is the Heaviside step function with Q{y) — 1 for y > and 
Q{y) — otherwise. From the definition of f{k), the right-hand side of Eq. (2) 

^ Similar result when both f{k) and Pint{n) are of the power-law form has been 
discussed in Ref. [2] . 

^ The assumption appears to be very reasonable in modern societies or in big cities, 
while it may not be vaUd for small rural towns where the family name distribution 
is far from the one for the whole country. 
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Fig. 3. (a) The number Nf of family names versus the population A'^, computed 
from Pint for KoreaOO and Korea85 via Eq. (3). For comparison, data points in Fig. 1 
are included, (b) The probability Ii{k) that an individual of the family name at the 
rank k is selected by the random sampling of the population. Il{k) is a monotonically 
decreasing function of k, implying that people with rare names are difficult to be 
found in urban cities. The points are obtained by using Eq. (4) and values in Table 1 
for Seongnam98, Osan04, AjouOS, and Hwaseong04. The data point marked by filled 
square is for Ajou03 and is somehow special since it is not based on the residential 
location but from the list of students in a university. Points denoted by the filled 
circles are obtained from random sampling within Ajou03. 

equals to the value of k at the crossing point of the two curves f{k) and 1/x, 
yielding 

Nf{N) = f-\N,,,/N) = Nf{Ntot)Pint{Ntot/N), (3) 

where N — xNtot and the identity in Eq. (1) have been used. It should be 
noted that the relation between the Zipf s plot of f{k) and the family name 
distribution function in Eq. (1) should hold in any case, while the relation 
between Nf{N) and Pint holds only when the assumption of unbiased random 
sampling of population is valid. As a specific example, for the power-law form 
-Pint(^) ~ ^f{^) ~ is expected. It is not clear whether or not this 
relation has been violated in Ref . [2] , where somehow self-contradictory results 
Pintin) ~ ri~^-'^^ and Nf{N) ~ jsfO.bs j-^g^yg been concluded from the same unbi- 
ased random sampling of population. However, if the proper size of errorbars 
are taken, we strongly believe that the discrepancy should vanish. 

In Fig. 3(a), Nf{N) from Pint for KoreaOO and Korea85 are displayed together 
with the values in Table 1. Although all curves show the logarithmic depen- 
dence, i.e., Nf ~ IniV, Nf{N) from Pint computed by using Eq. (3) shows 
a systematic difference that it lies higher than the empirical results at any 
N. This makes us to conclude that the assumption of unbiased random sam- 
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pling of population is not entirely valid for data sets in Table 1. Nevertheless, 
the logarithmic dependence persists for all three different curves in Fig. 3(a), 
which appears to justify the unbiased sampling assumption as a reasonable 
approximation. The substantial difference in Fig. 3(a) can be interpreted as 
follows: Significant number of families exist in localized area. For example, in 
some villages in rural area, most of people in the community have the same 
family name. Another possibility is that new names originate from foreign- 
ers who just became Koreans with new Korean names. The latter case can 
be probably found only in the biggest city, Seoul. Those non-uniformly dis- 
tributed names in a geographic sense are captured in nationwide survey but 
one cannot find them in the telephone book in cities we are investigating in 
this work. Accordingly, actual values of Nf{N) can be different and lie lower 
than the expected values from Eq. (3) based on the assumption of the unbiased 
random sampling. 

We then elaborate the above explanation further by introducing the proba- 
bility Il{k) that the name of the kth rank is chosen. The unbiased random 
sampling corresponds to n(/c) = 1. The extension of the above derivation of 
Nf{N) is straightforward: The number of people of the name k in the subset of 
population is now given by xll{k)f{k) and the Nf{N) is the number of family 
names satisfying xll{k)f{k) > 1. From the same reasoning as before, Nf(N) 
with g{k) = Il{k)f{k) now reads Nf{N) = g~^{Ntot/N). Consequently, one 
gets the expression of the probability that the name at the rank Nf is chosen 
as 

U(Nf ) = ^ . (4) 

We then use the values Nf and from the empirical values in Table 1 and 
then take f{Nf) from the Zipf's plot for KoreaOO to compute T{{k) in Fig. 3(b), 
where Seongnam98, Osan04, Ajou03, Hwaseong04 have been used. The data 
point (marked as filled square) for AjouOS is somehow special since AjouOS is 
not based on the residential location of individuals (Table 1). It is interesting 
to note that T\.{k) is a decreasing function of A;, suggesting that a randomly 
chosen individual in urban cities is more probable to have top rank names than 
for the whole nation. This appears to be consistent with the expectation that 
rare names (with lower ranks) are not distributed uniformly across the whole 
country. We also sampled randomly A^ individuals (A^ = 500, 1000, 2000, 4000) 
from Ajou03 and count how many names {Nf) are found in the set. We then 
compute n(A^y^) in the same way [filled circles in Fig. 3(b)]. 



2.2 Zipf's plots of f{k) 



In this Section, we investigate the distributions of family names for various 
data sets in Table 1. In Fig. 4, all data sets in Table 1 are shown to have 



7 



10^ 
g 10^ 

10^ 

10° 
10° 

10-2 

I 10-4 

10"^ 

10"^ 

Fig. 4. (a) The population f(k) as a function of the rank k of the family (Zipf's 
plot) for all data sets (see Table 1). (b) f{k)/N versus k for the same data as in (a). 

qualitatively the same distribution: Zipf's plots of f{k) for all data sets show 
exponential decay form, implying that the integrated probability distribution 
follows the logarithmic form (see discussions in Sec. 2.1). The unanimously 
found logarithmic form for Pint{n) then suggests that the probability distri- 
bution P{n), which measures how many family names have the size n, follows 
the following form 

P{n) - n-^ (5) 

with 7 1 in Korea, in a sharp contrast to other countries where 7 > 1.5 
have been concluded. This somehow unique family name distribution in Korea 
can be understood from the birth-death model in Ref. [3], where it has been 
shown that P{n) ~ n"^ occurs in a stationary state if the total population 
docs not grow (or the death rate /j, = 1) and the rate a of new name genera- 
tion is very small. The latter condition appears to be fulfilled in Korea since 
only very small number of new names were generated between 1985 and 2000 
(see Table 1). Furthermore, the family names in Korea have been introduced 
to Korean societies very long time ago (at least one thousand years ago), and 
therefore one can assume that the Korean family name distribution is very 
close to the one at stationary state although the population is still growing. 
Although new name generation is also very rare in Japan like Korea, fam- 
ily names in Japan, on the contrary, have a short history (most names were 
created about 120 years ago [2]). The qualitatively different family name dis- 
tribution in Japan appears to imply that the distribution is still far from its 
stationary state as pointed out in Ref. [2]. From this perspective, it is very 
plausible that as time proceeds the family name distributions in other coun- 
tries also should approach to P{n) ~ eventually, in the condition that the 
rate of new name generation becomes sufficiently small. 
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Fig. 5. The population growth rate r as a function of (a) the rank k of the family 
and (b) the size n of the family, r is obtained from Korea85 and KoreaOO from 
r = - f (k)^^^^] / f {k^^^^ with fiky"' being the size of the family of the 

rank /c at a given year. Sizes of a number of families arc found to decrease giving 
negative values of r. In this case, we plot — r instead and denote those families as 
filled squares. The growth rate of total population rtot = 0.1377 is shown (full line) 
for comparisons. 

2.3 Growth rate of family size 



When the family name was introduced in human history, it is expected that 
the larger the family, the more probable the family's survival, since the size 
of the family was the measure of its strength in many respects such as the 
labor power and the number of warriors . In other words, the population 
growth rate of a family must have been an increasing function of the size of 
the family long time ago. However, in a modern society, the population growth 
rate is not necessarily an increasing function of the family size since the size 
is no longer a matter of life and death. 

We use two data sets Korea85 and KoreaOO to compute the growth rate of 
the family size for each family name in 15 years. The growth rate of the total 
population is 0.1377 during this period as shown in Table 1. It is found that 
no name actually disappeared between 1985 and 2000, but 11 new names were 
created. In Fig. 5, we plot the growth rate r as a function of (a) the rank k 
and (b) the size n of the family. There are several very interesting features. 
The growth rate of some family is huge. For example, one family at very low 
rank had only one individual as its member in 1985 while the family size 
became 462 in 2000, with the growth rate 461 in 15 years. This can never 
be explained from any biological reason (no one can have several hundreds 
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of children in 15 years), but rather must be a reflection of some sociological 
or psychological forces. Relatively high growth rates are mostly observed in 
families at lower ranks, and r{k) in Fig. 5(a) is roughly an increasing function 
of k. Such a nonuniform growth rate can be explained from the assumption 
that (1) the family name with a high growth rate was recently invented and 
that (2) the family of a smaller size tries harder to increase the number of its 
members by e.g., recruiting new members from relatives who still keep the old 
name. The above assumptions look reasonable since most people, especially 
who have small family sizes, probably want to see their names flourishing, not 
disappearing in the future. 

In Fig. 5(b), we display the growth rate r{n) as a function of the family 
size n. As expected from Fig. 5(a), r(n) is a roughly decreasing function of n, 
saturating towards the total growth rate rtot- It is interesting to see that r{n) is 
signiflcantly higher than rtot when n < ric — 10^ ■ The size ric can be interpreted 
as the size scale beyond which the nonbiological growth is dominated by the 
biological growth. One can also associate rif. with a psychological turning point 
in people's mind separating small and big families. 



3 Family Name With Regional Origin 

As an additional way to distinguish families in Korea, the regional origins 
of families are simultaneously used in Korea. In this Section, we regard the 
family name with different regional origins as distinctive names, e.g., Kim 
from Gimhae and Kim from Gyeongju are considered to be different names, 
and study various aspects of the distribution. 

Figure 6 shows the integrated probability function Pint(^) measuring the pro- 
portion of families which have more than n family members. The distribu- 
tion in Fig. 6, with the broad intermediate range described by the power-law 
behavior, is very different from the corresponding plot in Fig. 2. In other 
words, the family name distributions are completely different with and with- 
out the regional origins. Furthermore, although the distribution of the family 
names without information of regional origins in Korea is very different from 
other countries, if the regional origins are used to distinguish one family from 
others the distribution shows the qualitatively similar behavior to that in 
other countries. More specifically, the power-law behavior P(n) ~ n~'^ [i.e. 
Pmt{n,) ~ n^~'^] is observed in a broad range of n. 

Figure 7 is for the subsets of (a) KoreaOO and (b) Korea85. For example, 
KimOO is a subset of KoreaOO that contains only the name Kim but from 
different regional origins. The three biggest families Kim, Lee, and Park have 
348, 283, and 159 different regional origins in KoreaOO, and 281, 244, and 127 
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Fig. 6. Integrated probability function Pmtin) versus the family size when the name 
with different regional origins taken as different for data sets (a) KoreaOO and (b) 
Korea85. 
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Fig. 7. -Pint('^) for subsets with the family name Kim, Lcc, and Park subtracted from 
(a) KoreaOO and (b) Korea85. Kim's with different regional origins are considered 
to be different. The power-law decay behavior of Pint(") are ubiquitously seen. 

in Korea85, implying that new regional origins have been branching out quite 
rapidly within a family. All curves in Fig. 7 have similar power-law decay 
behaviors as in Fig. 6. 



We finally investigate in Fig. 8 the growth rate r{n) for the family name with 
regional origin as a function of the family size n for (a) the whole dataset 
Korea, and for three major family names (b) Kim, (c) Lee, and (d) Park; All 
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Fig. 8. Growth rate r{n) as a function of family size n for (a) Korea, (b) Kim, 
(c) Lee, (d) Park. Identical family names but with different regional origins are 
considered as different. 

show qualitatively the same behaviors as in Fig. 5 in Sec. 2.3, i.e., very high 
growth rate for small families, and satm^ation to average growth rate beyond 
around n = 10'^. It is interesting to note that as long as the growth rate r(n) is 
concerned, the behavior does not depend much on whether the regional origins 
are taken into account or not. This implies that a family with very common 
family name but with a very rare regional origin still considers itself as small 
and thus makes tremendous nonbiological efforts to increase its members. 



4 Conclusion 



We in this paper have studied the distributions of Korean family names ex- 
tracted from various sources: from nationwide surveys in 1985 and 2000, from 
telephone books published in three cities, and from the list of registered stu- 
dent in a university. Family name with and without regional origins have been 
found to show quite different distributions: Integrated probability distribu- 
tion is logarithmic without regional origins (Kim from Gimhae and Kim for 
Gyeongju are regarded as the same name), while it is power-law behavior 
with regional origins (Kim from Gimhae is considered to be different name 
than Kim from Gyeongju). The difference is also reflected in the Zipf's plot 
of family size f{k) with the family rank k: exponential versus power-law. 

Relation between Pint(^) and f{k) have been established, and Nf{N) (how 
many family names are found in population A^) has been shown to have simple 
relation with f{k) (and thus with Pint(^^)) by using the assumption of random 
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unbiased sampling. The empirical observations show systematic deviations 
from the expected values, which was then used to compute the probability 
n(fc) that the family at the rank k is selected if we pick one individual at 
random. Interestingly, Il{k) has been found to be decreasing function of k, 
implying that small family names are hard to find in general. 

Growth rate r{n) for the family of the size n has been computed from empirical 
data (with and without regional origins). All show very interesting behaviors: 
Huge growth rate [O(IO^) in 15 years] for small families implying nonbio logical 
growth, and saturation towards average growth rate starting from around Hc ~ 
10^, which has been interpreted as the sociological/psychological separation 
point of big and small family. 
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